Spike Computational cost

This spike is intended to provide ideas how of to make your code more efficient. The data used is the WIFI dataset.

It will approach 3 methods: Smart samples, parallel processing, modeling without caret, opmitization of random forest (mtry)

SMART SAMPLES

First, imagine you want to sample the data to try different models faster. You could use the functions sample_n, but you would inccur the risk of having a undistributed sample (such as all observations in building 0)

Load data

Sample data

check frequency floor

## 
##  0  1  2  3  4 
## 30 30 30 30 10

check frequency building

## 
##  0  1  2 
## 40 40 50

plot sample - Building 0, Building 1, Building 2

SPECIFIC PACKAGES

Random Forest: package randomForest

This is the most usual package for training a random forest. It’s very user friendly and robust. If you want to learn more about other packages check this resource.

Let’s see which are the main parameters of the function randomForest:
  • ntree: number of trees to grow
  • mtry: how many random variables will be selected to grow in a single tree
  • importance: should importance of predictors be assessed? Keep in mind that if your data includes categorical variables with different number of levels, random forests are biased in favor of those variables with more levels.

Another useful function from this package is tuneRF(). Starting with the default value of mtry, it searchs for the optimal value.

Your turn! Try to obtain the best mtry for your data and train a random forest using this package and the caret package.

PARALLEL PROCESSING

A computer usually has multiple cores. Tipically, R is going to use only one of them, but we can increase this number, allowing us to execute more computations at the same time.

How to do it on Windows
  • Install the doParallel package
  • Check how many cores you have with the function detectCores().
  • Save the number of cores that you would like to execute with the function makeCluster(). A good practice is to leave one for other tasks.
  • Register the cluster with the function registerDoParallel()
How to do it on Mac/Linux
  • Install the doMC package
  • Check how many cores you have with the function getDoParWorkers()
  • Save the number of cores that you would like to execute with the function makeCluster(). A good practice is to leave one for other tasks.
  • Register the cluster with the function registerDoMC()

Now you can apply parallel processing! For example, you can use it in the cross validation or in the RF with the parameter “allowParallel = TRUE”.

Challenge: Train the same sample with parallel processing

SAVING AND LOADING MODELS

You can save your best models to a file. This way, you will be able to load/share them later.
  • For saving a model you can mainly use two functions: save(____.rda) or saveRDS(____.rds)
  • For loading a model you will need to use load(____.rda) or readRDS(____.rds)

Your turn! Try to save and load some models.

Gabriel Ristow Cidral / Sara Marin Lopez

11/04/2019